Summary and context

This markdown document will answer the questions shared by DiDi hiring manager. All of these are answered using SQL and R script.

A localhost database was created in order to show the query results. Because of that a simple run of the script wont publish the result again in another computer.

I’ll present each question and follow it by the chunk of SQL needed to answer it.

Challenge 1

Question 1

Write the SQL queries necessary to generate a list of the five restaurants that have the highest average number of visitors on holidays. The result table should also contain that average per restaurant.

IMPORTANT: I didn’t consider days without operation.

Summary

SELECT id as restaurant_id, avg(visitors) as avg_daily_visitors_on_holiday
FROM 
(SELECT rv.id, visit_date, sum(rv.reserve_visitors) as visitors
FROM restaurant_visitors rv
LEFT JOIN date_info di on rv.visit_date = di.calendar_date
WHERE di.holiday_flg = 1
GROUP BY id, visit_date) hrv
GROUP BY id
ORDER BY avg_daily_visitors_on_holiday desc
LIMIT 5
5 records
restaurant_id avg_daily_visitors_on_holiday
0a74a5408a0b8642 33.27273
e053c561f32acc28 30.63636
db80363d35f10926 25.20000
42c9aa6d617c5057 18.85714
36429b5ca4407b3e 18.44444

Additional Analysis

The top restaurant has an average of 33 daily customers on holidays:

top_five_rest_in_holiday.barplot

But in reality, the 4th and 5th could be rearranged, if instead of using the mean we use the median.

top_five_rest_in_holiday.boxplot

Its important to view the full distribution when a mean vs mean analysis is requested.

top_five_rest_in_holiday.densityplot

Question 2

Use SQL to discover which day of the week there are usually more visitors on average in restaurants.

IMPORTANT: I didn’t consider days without operation.

Total visitors

SELECT
day_of_week,
avg(visitors) as avg_visitors_by_weekday
FROM
(SELECT 
rv.visit_date, 
di.day_of_week, 
sum(reserve_visitors) as visitors
FROM restaurant_visitors rv
LEFT JOIN date_info di on rv.visit_date = di.calendar_date
GROUP BY visit_date, day_of_week) vrw
GROUP BY day_of_week
ORDER BY avg_visitors_by_weekday DESC
7 records
day_of_week avg_visitors_by_weekday
Saturday 193.85246
Friday 181.08197
Thursday 114.50847
Wednesday 92.26230
Sunday 79.96667
Monday 79.69492
Tuesday 76.98246

Saturday is the weekday with more avg costumers (bar plot). But the median (blue points) is almost the same on Friday and Saturday.

weekday_w_more_visitors.barplot

The distribution of Saturday tends to be higher, just with almost the same median as Friday. The distribution is skewed to the right, that’s whey there’s a big difference between median and mean.

weekday_w_more_visitors.boxplot

By restaurant

SELECT
day_of_week,
avg(visitors) as avg_visitors_by_store_by_weekday
FROM
(SELECT 
id, 
visit_date, 
day_of_week, 
sum(reserve_visitors) as visitors
FROM restaurant_visitors rv
LEFT JOIN date_info di on rv.visit_date = di.calendar_date
GROUP BY id, visit_date, day_of_week) vrw
GROUP BY day_of_week
ORDER BY avg_visitors_by_store_by_weekday DESC
7 records
day_of_week avg_visitors_by_store_by_weekday
Saturday 17.06349
Friday 15.93939
Thursday 12.17297
Wednesday 12.02564
Tuesday 11.73262
Monday 10.56629
Sunday 10.54505

In average, Saturday is the weekday with more avg visitors by restaurant (17). But again, the distribution of Friday and Saturday is not that different. The median number of visitors on each store in Friday and Saturday is 12. There are multiple outliers in all weekdays

weekday_w_more_visitors_by_rest.barplot
weekday_w_more_visitors_by_rest.boxplot

Question 3

How was the percentage of growth of the amount of visitors’ week over week for the last four weeks of the data? You can solve this question using SQL or any other tool that you prefer. If you use other tools, please add your code or files.

Summary

WITH 
last_date as (
SELECT date((max(visit_date) - 4*7))
FROM restaurant_visitors
),
daily_visitors as (
SELECT 
DATE_SUB(visit_date, INTERVAL DAYOFWEEK(visit_date)-2 DAY) as visit_week, 
sum(reserve_visitors) as visitors
FROM restaurant_visitors rv
LEFT JOIN date_info di on rv.visit_date = di.calendar_date
GROUP BY visit_week
ORDER BY visit_week
),
weekly_growth as (
(SELECT 
visit_week,  
visitors,
visitors /
lag(visitors, 1) OVER (ORDER BY visit_week) - 1 as weekly_visitors_growth
FROM daily_visitors)
)

SELECT *
FROM weekly_growth
WHERE visit_week >= (select * from last_date)
4 records
visit_week visitors weekly_visitors_growth
2017-05-08 175 0.0670732
2017-05-15 78 -0.5542857
2017-05-22 63 -0.1923077
2017-05-29 15 -0.7619048

There has been a continuous decrease in visitors the last 4 weeks

visitors_weekly_growth_percentage.lineplot

Additional analysis

There’s a gap in 2016, from the weeks of Jul 25 to Sep 12. And it appears that the number of visitors drastically decreased from the week of Apr 17, 2017 onward.

Weekly visitors

weekly_visitors.lineplot

Monthly visitors

monthly_visitors.lineplot

Daily visitors

daily_visitors.lineplot

The number of visitors last month drastically decreased. It appears that it went to the same average of 2016. But the number of unique restaurants didn’t decreased as much.

monthly_restaurants.lineplot

The average visitors by restaurant decreased in May 2017. But in general is around 161 visitors.

monthly_avg_visitors_by_rest.lineplot

Question 4

Forecast for the next six months, after the last date of the data, the sum of visitors of all the restaurants and validate the accuracy of your forecast.

I created two forecasts and challenged them against each other:

  1. An ensemble model, the product of two forecasts. Number of restaurants and the average visitors by restaurant.This so the model can be more accurate and to have more control on the results if an actionable is known, i.e. a strategy to acquire more restaurants.
  2. I’ll create a forecast directly for the total number of visitors.

The model with the least MAE was be used to forecast the next 6 months.

Forecast results

monthly_visitors_fcst.lineplot

The mode shows that Jun is being forecasted with the same bouncing effect that Jun 2016 had. Keeping close attention the the forecast and updating if required is important so the process improves month over month.

Process to select the forecast

I’ll show the results of the three time series: * Number of average visitors by restaurant * Number of restaurants * Number of visitors

Given the change of distribution in the behavior of the time series and the shortage of data, I created the forecast using ARIMA, Averages, Naive, Seasonal Naive, and TBATS. And I will select the best forecast comparing the MAE of each model using a training and test set.

Models metrics

ARIMA

Applying and auto.arima, p and q are 0. For average visitors by restaurant is purely white noise (d is 0). For restaurants is a random walk (d is 1). For total visitors is, again, a random walk (d is 1). Given average visitors is white noise the point forecast will be the same as a mean method. And because restaurants and total visitors are a random walk, the point forecast be the same as a naive model. Reference: https://people.duke.edu/~rnau/411arim.htm

Average visitors by restaurant
arima_models[[1]]
## Series: subseries 
## ARIMA(0,0,0) with non-zero mean 
## 
## Coefficients:
##           mean
##       159.1783
## s.e.   30.3838
## 
## sigma^2 estimated as 13001:  log likelihood=-79.5
## AIC=163   AICc=164.2   BIC=164.13
arima_model_result[[1]][,'MAE']
## Training set     Test set 
##     78.20526     78.40482
Unique restaurants

The gap between train and test set MAE is really high

arima_models[[2]]
## Series: subseries 
## ARIMA(0,1,0) 
## 
## sigma^2 estimated as 68:  log likelihood=-42.34
## AIC=86.69   AICc=87.09   BIC=87.17
arima_model_result[[2]][,'MAE']
## Training set     Test set 
##     3.846538     3.000000
Total visitors

The gap between train and test set MAE is really high

arima_models[[3]]
## Series: subseries 
## ARIMA(0,1,0) 
## 
## sigma^2 estimated as 4784777:  log likelihood=-109.31
## AIC=220.63   AICc=221.03   BIC=221.11
arima_model_result[[3]][,'MAE']
## Training set     Test set 
##     1255.147     2137.250
Average

Residuals on avg visitors and restaurant are white noise. I could rely on the results.

Average visitors by restaurant
avg_models[[1]]
##       Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## Feb 2       159.1783 -1.299244 319.6558 -98.63475 416.9914
## Mar 2       159.1783 -1.299244 319.6558 -98.63475 416.9914
## Apr 2       159.1783 -1.299244 319.6558 -98.63475 416.9914
## May 2       159.1783 -1.299244 319.6558 -98.63475 416.9914
## Jun 2       159.1783 -1.299244 319.6558 -98.63475 416.9914
## Jul 2       159.1783 -1.299244 319.6558 -98.63475 416.9914
## Aug 2       159.1783 -1.299244 319.6558 -98.63475 416.9914
## Sep 2       159.1783 -1.299244 319.6558 -98.63475 416.9914
## Oct 2       159.1783 -1.299244 319.6558 -98.63475 416.9914
## Nov 2       159.1783 -1.299244 319.6558 -98.63475 416.9914
avg_model_result[[1]][,'MAE']
## Training set     Test set 
##     78.20526     78.40482
Unique restaurants
avg_models[[2]]
##       Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## Feb 2       12.15385 -5.062737 29.37043 -15.50523 39.81292
## Mar 2       12.15385 -5.062737 29.37043 -15.50523 39.81292
## Apr 2       12.15385 -5.062737 29.37043 -15.50523 39.81292
## May 2       12.15385 -5.062737 29.37043 -15.50523 39.81292
## Jun 2       12.15385 -5.062737 29.37043 -15.50523 39.81292
## Jul 2       12.15385 -5.062737 29.37043 -15.50523 39.81292
## Aug 2       12.15385 -5.062737 29.37043 -15.50523 39.81292
## Sep 2       12.15385 -5.062737 29.37043 -15.50523 39.81292
## Oct 2       12.15385 -5.062737 29.37043 -15.50523 39.81292
## Nov 2       12.15385 -5.062737 29.37043 -15.50523 39.81292
avg_model_result[[2]][,'MAE']
## Training set     Test set 
##     10.67456     14.84615
Total visitors
avg_models[[3]]
##       Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## Feb 2       2248.769 -1789.273 6286.811 -4238.493 8736.032
## Mar 2       2248.769 -1789.273 6286.811 -4238.493 8736.032
## Apr 2       2248.769 -1789.273 6286.811 -4238.493 8736.032
## May 2       2248.769 -1789.273 6286.811 -4238.493 8736.032
## Jun 2       2248.769 -1789.273 6286.811 -4238.493 8736.032
## Jul 2       2248.769 -1789.273 6286.811 -4238.493 8736.032
## Aug 2       2248.769 -1789.273 6286.811 -4238.493 8736.032
## Sep 2       2248.769 -1789.273 6286.811 -4238.493 8736.032
## Oct 2       2248.769 -1789.273 6286.811 -4238.493 8736.032
## Nov 2       2248.769 -1789.273 6286.811 -4238.493 8736.032
avg_model_result[[3]][,'MAE']
## Training set     Test set 
##     2130.107     3626.865
Naive

Residuals on avg visitors and restaurant are not white noise, that means that something underlyin can be explained with more analysis. But I’ll keep the results if they succeed on the challenge.

Average visitors by restaurant
naive_models[[1]]
##       Point Forecast      Lo 80    Hi 80      Lo 95     Hi 95
## Feb 2       206.8276   31.51611 382.1391  -61.28818  474.9434
## Mar 2       206.8276  -41.10028 454.7555 -172.34537  586.0005
## Apr 2       206.8276  -96.82080 510.4760 -257.56255  671.2177
## May 2       206.8276 -143.79537 557.4505 -329.40395  743.0591
## Jun 2       206.8276 -185.18080 598.8360 -392.69750  806.3527
## Jul 2       206.8276 -222.59608 636.2513 -449.91924  863.5744
## Aug 2       206.8276 -257.00299 670.6582 -502.54006  916.1952
## Sep 2       206.8276 -289.02815 702.6833 -551.51833  965.1735
## Oct 2       206.8276 -319.10685 732.7620 -597.51972 1011.1749
## Nov 2       206.8276 -347.55599 761.2112 -641.02892 1054.6841
naive_model_result[[1]][,'MAE']
## Training set     Test set 
##    102.85999     69.98649
Unique restaurants
naive_models[[2]]
##       Point Forecast      Lo 80    Hi 80      Lo 95    Hi 95
## Feb 2             29 18.4320551 39.56794  12.837723 45.16228
## Mar 2             29 14.0546689 43.94533   6.143089 51.85691
## Apr 2             29 10.6957824 47.30422   1.006115 56.99389
## May 2             29  7.8641101 50.13589  -3.324554 61.32455
## Jun 2             29  5.3693567 52.63064  -7.139950 65.13995
## Jul 2             29  3.1139273 54.88607 -10.589332 68.58933
## Aug 2             29  1.0398458 56.96015 -13.761366 71.76137
## Sep 2             29 -0.8906621 58.89066 -16.713823 74.71382
## Oct 2             29 -2.7038348 60.70383 -19.486831 77.48683
## Nov 2             29 -4.4187762 62.41878 -22.109608 80.10961
naive_model_result[[2]][,'MAE']
## Training set     Test set 
##     4.166667     3.000000
Total visitors
naive_models[[3]]
##       Point Forecast      Lo 80     Hi 80       Lo 95    Hi 95
## Feb 2           5998  3194.7170  8801.283  1710.74861 10285.25
## Mar 2           5998  2033.5592  9962.441   -65.08906 12061.09
## Apr 2           5998  1142.5714 10853.429 -1427.73723 13423.74
## May 2           5998   391.4340 11604.566 -2576.50277 14572.50
## Jun 2           5998  -270.3314 12266.331 -3588.58554 15584.59
## Jul 2           5998  -868.6130 12864.613 -4503.57830 16499.58
## Aug 2           5998 -1418.7897 13414.790 -5345.00098 17341.00
## Sep 2           5998 -1930.8817 13926.882 -6128.17811 18124.18
## Oct 2           5998 -2411.8490 14407.849 -6863.75416 18859.75
## Nov 2           5998 -2866.7592 14862.759 -7559.47928 19555.48
naive_model_result[[3]][,'MAE']
## Training set     Test set 
##     1359.667     2137.250
Seasonal Naive

Seasonal Naive is tricky to use when only one season has passed. Nonetheless here are the results.

Average visitors by restaurant
snaive_models[[1]]
##       Point Forecast      Lo 80     Hi 80       Lo 95    Hi 95
## Feb 2      144.66667   73.12074 216.21260   35.246608 254.0867
## Mar 2      186.71429  115.16836 258.26022   77.294227 296.1343
## Apr 2      191.42857  119.88264 262.97450   82.008513 300.8486
## May 2      119.00000   47.45407 190.54593    9.579942 228.4201
## Jun 2      401.66667  330.12074 473.21260  292.246608 511.0867
## Jul 2      160.00000   88.45407 231.54593   50.579942 269.4201
## Aug 2        0.00000  -71.54593  71.54593 -109.420058 109.4201
## Sep 2        1.00000  -70.54593  72.54593 -108.420058 110.4201
## Oct 2       31.06897  -40.47696 102.61490  -78.351093 140.4890
## Nov 2      160.30000   88.75407 231.84593   50.879942 269.7201
## Dec 2      315.64516  244.09923 387.19109  206.225103 425.0652
## Jan 3      206.82759  135.28166 278.37352   97.407528 316.2476
## Feb 3      144.66667   43.48544 245.84789  -10.076664 299.4100
## Mar 3      186.71429   85.53306 287.89551   31.970955 341.4576
## Apr 3      191.42857   90.24735 292.60980   36.685241 346.1719
## May 3      119.00000   17.81877 220.18123  -35.743330 273.7433
## Jun 3      401.66667  300.48544 502.84789  246.923336 556.4100
## Jul 3      160.00000   58.81877 261.18123    5.256670 314.7433
## Aug 3        0.00000 -101.18123 101.18123 -154.743330 154.7433
## Sep 3        1.00000 -100.18123 102.18123 -153.743330 155.7433
## Oct 3       31.06897  -70.11226 132.25019 -123.674365 185.8123
## Nov 3      160.30000   59.11877 261.48123    5.556670 315.0433
## Dec 3      315.64516  214.46394 416.82639  160.901831 470.3885
## Jan 4      206.82759  105.64636 308.00881   52.084256 361.5709
snaive_model_result[[1]][,'MAE']
## Training set     Test set 
##     55.82759     62.61794
Unique restaurants
snaive_models[[2]]
##       Point Forecast       Lo 80    Hi 80     Lo 95    Hi 95
## Feb 2              5 -25.7572376 35.75724 -42.03914 52.03914
## Mar 2              6 -24.7572376 36.75724 -41.03914 53.03914
## Apr 2              7 -23.7572376 37.75724 -40.03914 54.03914
## May 2              6 -24.7572376 36.75724 -41.03914 53.03914
## Jun 2              3 -27.7572376 33.75724 -44.03914 50.03914
## Jul 2              7 -23.7572376 37.75724 -40.03914 54.03914
## Aug 2              0 -30.7572376 30.75724 -47.03914 47.03914
## Sep 2              1 -29.7572376 31.75724 -46.03914 48.03914
## Oct 2             28  -2.7572376 58.75724 -19.03914 75.03914
## Nov 2             30  -0.7572376 60.75724 -17.03914 77.03914
## Dec 2             31   0.2427624 61.75724 -16.03914 78.03914
## Jan 3             29  -1.7572376 59.75724 -18.03914 76.03914
## Feb 3              5 -38.4973025 48.49730 -61.52338 71.52338
## Mar 3              6 -37.4973025 49.49730 -60.52338 72.52338
## Apr 3              7 -36.4973025 50.49730 -59.52338 73.52338
## May 3              6 -37.4973025 49.49730 -60.52338 72.52338
## Jun 3              3 -40.4973025 46.49730 -63.52338 69.52338
## Jul 3              7 -36.4973025 50.49730 -59.52338 73.52338
## Aug 3              0 -43.4973025 43.49730 -66.52338 66.52338
## Sep 3              1 -42.4973025 44.49730 -65.52338 67.52338
## Oct 3             28 -15.4973025 71.49730 -38.52338 94.52338
## Nov 3             30 -13.4973025 73.49730 -36.52338 96.52338
## Dec 3             31 -12.4973025 74.49730 -35.52338 97.52338
## Jan 4             29 -14.4973025 72.49730 -37.52338 95.52338
snaive_model_result[[2]][,'MAE']
## Training set     Test set 
##           24           21
Total visitors
snaive_models[[3]]
##       Point Forecast      Lo 80     Hi 80       Lo 95     Hi 95
## Feb 2            868 -5657.6606  7393.661  -9112.1366 10848.137
## Mar 2           1307 -5218.6606  7832.661  -8673.1366 11287.137
## Apr 2           1340 -5185.6606  7865.661  -8640.1366 11320.137
## May 2            833 -5692.6606  7358.661  -9147.1366 10813.137
## Jun 2           1205 -5320.6606  7730.661  -8775.1366 11185.137
## Jul 2           1280 -5245.6606  7805.661  -8700.1366 11260.137
## Aug 2              0 -6525.6606  6525.661  -9980.1366  9980.137
## Sep 2              2 -6523.6606  6527.661  -9978.1366  9982.137
## Oct 2            901 -5624.6606  7426.661  -9079.1366 10881.137
## Nov 2           4809 -1716.6606 11334.661  -5171.1366 14789.137
## Dec 2           9785  3259.3394 16310.661   -195.1366 19765.137
## Jan 3           5998  -527.6606 12523.661  -3982.1366 15978.137
## Feb 3            868 -8360.6777 10096.678 -13246.0445 14982.045
## Mar 3           1307 -7921.6777 10535.678 -12807.0445 15421.045
## Apr 3           1340 -7888.6777 10568.678 -12774.0445 15454.045
## May 3            833 -8395.6777 10061.678 -13281.0445 14947.045
## Jun 3           1205 -8023.6777 10433.678 -12909.0445 15319.045
## Jul 3           1280 -7948.6777 10508.678 -12834.0445 15394.045
## Aug 3              0 -9228.6777  9228.678 -14114.0445 14114.045
## Sep 3              2 -9226.6777  9230.678 -14112.0445 14116.045
## Oct 3            901 -8327.6777 10129.678 -13213.0445 15015.045
## Nov 3           4809 -4419.6777 14037.678  -9305.0445 18923.045
## Dec 3           9785   556.3223 19013.678  -4329.0445 23899.045
## Jan 4           5998 -3230.6777 15226.678  -8116.0445 20112.045
snaive_model_result[[3]][,'MAE']
## Training set     Test set 
##      5092.00      4080.75
TBATS

TBATS is the combination of Trigonometric seasonality, Box-Cox transformation, ARIMA errors Trend, Seasonal components

This methodolgy tends to perform well when there’s more than one single distribution in a time serie. Something that is presented in all of three.

Average visitors by restaurant
tbats_models[[1]]
## TBATS(1, {1,0}, 1, {<12,5>})
## 
## Call: tbats(y = subseries)
## 
## Parameters
##   Alpha: -0.02011887
##   Beta: 0.0001004308
##   Damping Parameter: 1
##   Gamma-1 Values: 1.445945e-05
##   Gamma-2 Values: 0.0002511197
##   AR coefficients: -0.983529
## 
## Seed States:
##            [,1]
##  [1,] 124.21924
##  [2,]   4.75464
##  [3,]  28.59260
##  [4,]  50.28093
##  [5,] -36.73593
##  [6,] -19.23905
##  [7,]  11.54502
##  [8,]  84.15764
##  [9,] -87.81545
## [10,]  16.37373
## [11,] -66.74894
## [12,]  26.35279
## [13,]   0.00000
## 
## Sigma: 5.515416
## AIC: 113.7406
tbats_model_result[[1]][,'MAE']
## Training set     Test set 
##     4.058811    62.658474
Unique restaurants
tbats_models[[2]]
## TBATS(1, {1,0}, 1, {<12,5>})
## 
## Call: tbats(y = subseries)
## 
## Parameters
##   Alpha: -0.0201189
##   Beta: 0.000100343
##   Damping Parameter: 1
##   Gamma-1 Values: 1.445892e-05
##   Gamma-2 Values: 0.0002511196
##   AR coefficients: -0.983523
## 
## Seed States:
##             [,1]
##  [1,] -2.2688061
##  [2,]  2.0025719
##  [3,]  8.7976634
##  [4,] -2.2178185
##  [5,] -3.2182471
##  [6,]  2.9482767
##  [7,] -0.7315858
##  [8,] -0.6949788
##  [9,] -4.9074773
## [10,]  1.6666667
## [11,]  1.1547005
## [12,] -2.1383545
## [13,]  0.0000000
## 
## Sigma: 0.1386042
## AIC: 17.96488
tbats_model_result[[2]][,'MAE']
## Training set     Test set 
##    0.1019991    3.6632624
Total visitors
tbats_models[[3]]
## TBATS(1, {1,0}, 1, {<12,5>})
## 
## Call: tbats(y = subseries)
## 
## Parameters
##   Alpha: -0.02011887
##   Beta: 0.0001004308
##   Damping Parameter: 1
##   Gamma-1 Values: 1.445945e-05
##   Gamma-2 Values: 0.0002511197
##   AR coefficients: -0.983529
## 
## Seed States:
##              [,1]
##  [1,]  -828.32414
##  [2,]   425.22100
##  [3,]  2148.93838
##  [4,]   806.42866
##  [5,]  -535.96929
##  [6,]  -423.10193
##  [7,]  -579.48928
##  [8,]   628.60792
##  [9,] -1351.43264
## [10,]  -934.16667
## [11,]  -590.58121
## [12,]   -70.27458
## [13,]     0.00000
## 
## Sigma: 47.83824
## AIC: 169.9078
tbats_model_result[[3]][,'MAE']
## Training set     Test set 
##     35.20431   2137.10158

Method selection, challenging models

After the previous analysis we discovered that: * TBATS had the best performance for average monthly visitors. Even though it had overfitting, had the least MAE against the test set. Future analysis would be required to maintain the forecast. * Naive forecast had a good result on unique restaurants. Its a simple method but when erratic, it is a good and easy approach to forecast.

Now I evaluated which model is more accurate. I did an ensemble model multiplying average visitors by unique restaurants results and compared it against total visitors forecast. I selected the method with the least MAE against test set.

mae_ttl_visitors
## total_visitors_ensemble_mae    total_visitors_tbats_mae 
##                    1986.521                    2137.102

Ensemble model is more accurate. It has an average absolute error of 1986. That means that the result in average each month will have a forecast ± 1986 total visitors.

Once the method was selected, I did and update on naive and TBATS model so they included the latest data. With that I forecasted from Jun to Nov

Question 5

Based on the data and your ideas, plan strategies to double the total restaurant visitors in six months.

Strategies that in total could double the amount of visitors

Strategy 1

  1. Add more restaurants in cities with high population density. There are multiple cities where there’s high population density but with lower number of restaurants.
restaurants_position

Below is a Japan population density plot Japan population density

Strategy 2

  1. Add more genres of food There was a hughe increase after more genres were implemented. Izakaya had a great impact on visitors. I would suggest to explore a healthy or international food segment.
visitors_by_genre.barplot

Strategy 3

  1. Set reminder to reserve a visit to favorite place or suggest other options. Fridays and Saturdays have a higher number of days between reservation and visit. It could help send reminders to reserve a restaurant. And if the the restaurant is already fully booked we can suggest a similar place with available seats.
days_from_reserve_to_visit.boxplot

Note: You can zoom-in where the boxplots are located, all the outliers impact zooming-out the boxplot graph.

Strategy 4

  1. Gather more data. Lastly, the forecast tells us that maybe without any additional effort next month we will without any doubt double the number of visitors of last month. So it would be a great time to get more data and more variables.

Question 6

Imagine that these restaurants are in your city (and not in Japan), what other data would you want to join in order of get more insights to increase the visitors?

User information.

This is crucial to understand what drives the user to reserve or purchase anything. Age, consumption habits, etc.

App usage information is extremely important.

Maybe there’s a high traffic of requests to visit a place but something in the funnel makes the user drop the reservation.

Prices of the places and more information of the restaurant is crucial

There could be the opportunity to understand demand elasticity given different prices. I’ve stopped placing orders in some restaurants because the price increased a lot.

Challenge 2

Question 1

How many active users and new users do we have for each week of November 2019 to February 2020?

Definitions New user definition: Count of unique user ids on each user min order_info.order_date Active user definition: Unique user with order grouped by order_info.order_date.

WITH 
new_users as (
SELECT 
user_id, 
min(order_date) as order_date
FROM order_info oi
GROUP BY user_id
),
weekly_new_users as (
select 
DATE_SUB(order_date, 
    INTERVAL DAYOFWEEK(order_date)-2 DAY) as order_week, 
count(distinct user_id) as new_users
from new_users
group by order_week
),
weekly_unique_active_users as (
SELECT 
DATE_SUB(order_date, 
    INTERVAL DAYOFWEEK(order_date)-2 DAY) as order_week, 
count(distinct user_id) as active_users
FROM order_info oi
GROUP BY order_week
)

SELECT *
FROM weekly_unique_active_users wau 
LEFT JOIN weekly_new_users wnu USING (order_week)
WHERE order_week between '2019-11-01' and '2020-02-29'
ORDER BY order_week
Displaying records 1 - 10
order_week active_users new_users
2019-11-04 7210 7009
2019-11-11 8651 7544
2019-11-18 10002 8233
2019-11-25 10905 8676
2019-12-02 12140 9458
2019-12-09 12665 9161
2019-12-16 11056 6730
2019-12-23 8207 4638
2019-12-30 8318 4730
2020-01-06 9774 5614

For question 4

new_and_active_weekly_users.lineplot

Question 2

How many reengaged users do we have (Reengaged: active this week that didn’t have an order last week but they did before that) for each week of November 2019 to February 2020?

Reengaged Users metric definition: Count of unique user ids where the number of weeks between current order date and previous one is greater than 1.

WITH 
user_by_week AS (
SELECT 
*,
lag(order_week, 1) OVER (PARTITION BY user_id ORDER BY user_id) previous_order_week
FROM
(SELECT 
user_id,
DATE_SUB(order_date, 
    INTERVAL DAYOFWEEK(order_date)-2 DAY) as order_week
FROM order_info oi
GROUP BY user_id, order_week
ORDER BY user_id, order_week) upw
),
reengaged_users AS (
SELECT *, floor(DATEDIFF(order_week, previous_order_week)/7) as inactive_weeks
FROM user_by_week ubw
WHERE floor(DATEDIFF(order_week, previous_order_week)/7) > 1
)

SELECT order_week, count(distinct user_id) as reengaged_users
FROM reengaged_users ru
WHERE order_week between '2019-11-01' and '2020-02-29'
GROUP BY order_week
ORDER BY order_week
Displaying records 1 - 10
order_week reengaged_users
2019-11-11 54
2019-11-18 546
2019-11-25 821
2019-12-02 1254
2019-12-09 1704
2019-12-16 2241
2019-12-23 1955
2019-12-30 2207
2020-01-06 2727
2020-01-13 2633

For question 4

weekly_reengaged_users.lineplot

Question 3

What’s the average GMV by type of user (Active, new, reengaged) for each week of November 2019 to February 2020?

IMPORTANT: The definitions of each of the three user types are not mutually exclusive, new users and reengaged users are in fact active users.

WITH 
user_by_week AS (
SELECT 
*,
lag(order_week, 1) OVER (PARTITION BY user_id ORDER BY user_id) previous_order_week
FROM
(SELECT 
user_id,
DATE_SUB(order_date, 
    INTERVAL DAYOFWEEK(order_date)-2 DAY) as order_week
FROM order_info oi
GROUP BY user_id, order_week
ORDER BY user_id, order_week) upw
),
reengaged_users AS (
SELECT *, floor(DATEDIFF(order_week, previous_order_week)/7) as inactive_weeks
FROM user_by_week ubw
WHERE floor(DATEDIFF(order_week, previous_order_week)/7) > 1
),
weekly_reengaged_users AS (
SELECT order_week, user_id
FROM reengaged_users ru
GROUP BY order_week, user_id
ORDER BY order_week, user_id
),
reengaged_users_orders AS (
SELECT 
wru.order_week, 
avg(gmv) as avg_gmv
FROM weekly_reengaged_users wru
LEFT JOIN order_info oi ON 
    (wru.user_id = oi.user_id AND
        DATE_SUB(oi.order_date, 
            INTERVAL DAYOFWEEK(order_date)-2 DAY) =
        wru.order_week)
GROUP BY wru.order_week
),
new_users_orders AS (
SELECT 
DATE_SUB(oi.order_date, 
    INTERVAL DAYOFWEEK(order_date)-2 DAY) as order_week, 
avg(gmv) as avg_gmv
FROM user_info ui
LEFT JOIN order_info oi ON 
    (ui.first_order = oi.order_id AND oi.user_id = ui.user_id)
GROUP BY order_week
),
active_users_orders as (
SELECT 
DATE_SUB(order_date, 
    INTERVAL DAYOFWEEK(order_date)-2 DAY) as order_week, 
avg(gmv) as avg_gmv
FROM order_info oi
GROUP BY order_week
)

SELECT *, 'New User' AS user_type
FROM new_users_orders nuo
WHERE nuo.order_week between '2019-11-01' and '2020-02-29'
UNION
SELECT *, 'Active User' AS user_type
FROM active_users_orders auo
WHERE auo.order_week between '2019-11-01' and '2020-02-29'
UNION
SELECT *, 'Reengaged User' AS user_type
FROM reengaged_users_orders ruo
WHERE ruo.order_week between '2019-11-01' and '2020-02-29'
Displaying records 1 - 10
order_week avg_gmv user_type
2019-11-18 161.1856 New User
2019-11-25 158.0916 New User
2020-01-06 164.6383 New User
2019-12-16 170.8322 New User
2020-01-20 152.4472 New User
2020-02-24 159.7361 New User
2019-11-04 139.1320 New User
2020-02-10 166.2084 New User
2020-02-17 164.1437 New User
2019-12-09 166.3906 New User

For question 4

avg_gmv_per_user_type.lineplot

Question 4

On your preferred tool (Excel, Python, R, etc.) please create charts for each of your results.

I placed the graphs on each question

Question 5

Based on the charts give your opinion/recommendations regarding to the different type of users.

Recommentaion 1

  1. Create segments for recurring users based on the number of orders they’ve made.

It’s important to have order_date as time stamp. It allows you to understand whats the number of orders before churning and the level of engagement the users have with your application.

avg_gmv_by_order_n.lineplot

Order number definition: Number of order by user_id, is cumulative.

Recommentaion 2

  1. A high percentage of users only has 1 single order. If DiDi has a high churn rate, then it won’t be able to have a sustainable growth. There should be more analysis into why this is happening.

It’s important to have order_date as time stamp. It allows you to understand whats the number of orders before churning and the level of engagement the users have with your application.

inactive_users.sankeyplot

Recommentaion 3

  1. There should be experiments with behavioral economics to accelerate engagement and ordering. As the number of order increases, the number of days to recur decreases.
days_to_recur.boxplot